{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation. All rights reserved.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inference PyTorch GPT2 Model with ONNX Runtime on CPU\n",
    "\n",
    "In this tutorial, you'll be introduced to how to load a GPT2 model from PyTorch, convert it to ONNX, and inference it using ONNX Runtime using IO Binding. Note that past state is used to get better performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites ##\n",
    "\n",
    "If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/) and other required packages.\n",
    "\n",
    "Otherwise, you can setup a new environment. First, we install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:\n",
    "\n",
    "```console\n",
    "conda create -n cpu_env python=3.10\n",
    "conda activate cpu_env\n",
    "pip install jupyterlab\n",
    "conda install ipykernel\n",
    "conda install -c conda-forge ipywidgets\n",
    "ipython kernel install --user --name cpu_env\n",
    "jupyter-lab\n",
    "```\n",
    "The last command will launch JupyterLab, then we can open this notebook and select kernel cpu_env to run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Please refer to https://pytorch.org to install CPU-only PyTorch\n",
    "import sys\n",
    "if sys.platform in [\"darwin\", \"win32\"]:  # Mac or Windows\n",
    "    !{sys.executable} -m pip install torch -q\n",
    "else:\n",
    "    !{sys.executable} -m pip install install torch --index-url https://download.pytorch.org/whl/cpu -q\n",
    "\n",
    "!{sys.executable} -m pip install onnxruntime transformers==4.18 onnx psutil pandas py-cpuinfo py3nvml netron coloredlogs --no-warn-script-location -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "if sys.platform in [\"win32\"]:\n",
    "    os.environ[\"HF_HUB_DISABLE_SYMLINKS_WARNING\"] = \"1\"\n",
    "\n",
    "# Create a cache directory to store pretrained model.\n",
    "cache_dir = os.path.join(\".\", \"cache_models\")\n",
    "if not os.path.exists(cache_dir):\n",
    "    os.makedirs(cache_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert GPT2 model from PyTorch to ONNX ##\n",
    "\n",
    "We have a script [convert_to_onnx.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/gpt2/convert_to_onnx.py) that could help you to convert GPT2 with past state to ONNX. \n",
    "\n",
    "The script accepts a pretrained model name or path of a checkpoint directory as input, and converts the model to ONNX. It also verifies that the ONNX model could generate same input as the pytorch model. The usage is like \n",
    "```\n",
    "python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m model_name_or_path --output gpt2.onnx -o -p fp32\n",
    "python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m model_name_or_path --output gpt2.onnx -o -p fp16 --auto_mixed_precision\n",
    "```\n",
    "The -p option can be used to choose the precision: fp32 (float32), fp16 (mixed precision) or int8 (quantization). The -o option will generate optimized model, which is required for fp16 or int8. Mixed precision model by --auto_mixed_precision is recommended for GPU inference. For CPU inference, fp32 model is recommended since int8 model might have large accuracy loss.\n",
    "\n",
    "Here we use a pretrained model as example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPT2Config {\n",
      "  \"_name_or_path\": \"gpt2\",\n",
      "  \"activation_function\": \"gelu_new\",\n",
      "  \"architectures\": [\n",
      "    \"GPT2LMHeadModel\"\n",
      "  ],\n",
      "  \"attn_pdrop\": 0.1,\n",
      "  \"bos_token_id\": 50256,\n",
      "  \"embd_pdrop\": 0.1,\n",
      "  \"eos_token_id\": 50256,\n",
      "  \"initializer_range\": 0.02,\n",
      "  \"layer_norm_epsilon\": 1e-05,\n",
      "  \"model_type\": \"gpt2\",\n",
      "  \"n_ctx\": 1024,\n",
      "  \"n_embd\": 768,\n",
      "  \"n_head\": 12,\n",
      "  \"n_inner\": null,\n",
      "  \"n_layer\": 12,\n",
      "  \"n_positions\": 1024,\n",
      "  \"reorder_and_upcast_attn\": false,\n",
      "  \"resid_pdrop\": 0.1,\n",
      "  \"scale_attn_by_inverse_layer_idx\": false,\n",
      "  \"scale_attn_weights\": true,\n",
      "  \"summary_activation\": null,\n",
      "  \"summary_first_dropout\": 0.1,\n",
      "  \"summary_proj_to_labels\": true,\n",
      "  \"summary_type\": \"cls_index\",\n",
      "  \"summary_use_proj\": true,\n",
      "  \"task_specific_params\": {\n",
      "    \"text-generation\": {\n",
      "      \"do_sample\": true,\n",
      "      \"max_length\": 50\n",
      "    }\n",
      "  },\n",
      "  \"transformers_version\": \"4.18.0\",\n",
      "  \"use_cache\": true,\n",
      "  \"vocab_size\": 50257\n",
      "}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from onnxruntime.transformers.models.gpt2.gpt2_helper import Gpt2Helper, MyGPT2LMHeadModel\n",
    "from transformers import AutoConfig\n",
    "import torch\n",
    "\n",
    "model_name_or_path = \"gpt2\"\n",
    "config = AutoConfig.from_pretrained(model_name_or_path, cache_dir=cache_dir)\n",
    "model = MyGPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)\n",
    "device = torch.device(\"cpu\")\n",
    "model.eval().to(device)\n",
    "\n",
    "print(model.config)\n",
    "\n",
    "num_attention_heads = model.config.n_head\n",
    "hidden_size = model.config.n_embd\n",
    "num_layer = model.config.n_layer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "onnx_model_path = \"gpt2.onnx\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m $model_name_or_path --output $onnx_model_path -o -p fp32 -t 10 >export_output.txt 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please pay attention to the optimized operators in the output. Counters of EmbedLayerNormalization, Attention, FastGelu and SkipLayerNormalization shall be positive for fully optimized GPT-2 model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimized operators:{'EmbedLayerNormalization': 1, 'Attention': 12, 'MultiHeadAttention': 0, 'Gelu': 0, 'FastGelu': 12, 'BiasGelu': 0, 'GemmFastGelu': 0, 'LayerNormalization': 0, 'SkipLayerNormalization': 24, 'QOrderedAttention': 0, 'QOrderedGelu': 0, 'QOrderedLayerNormalization': 0, 'QOrderedMatMul': 0}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "file = open(\"export_output.txt\", \"r\")\n",
    "for line in file.readlines():\n",
    "    if \"Optimized operators\" in line:\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyTorch Inference using Huggingface Transformers ##\n",
    "\n",
    "In the following, we will use an example input to get the output from PyTorch for comparison purpose.\n",
    "For the first inference, there is no any past state. We can prepare empty state for input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input_ids tensor([[50256, 50256, 50256, 50256, 13466,  7541,   287, 15489,  1989],\n",
      "        [ 1456,   318,   281,  1672,   286,   308,   457,    17,  2746]],\n",
      "       dtype=torch.int32)\n",
      "attention_mask tensor([[0, 0, 0, 0, 1, 1, 1, 1, 1],\n",
      "        [1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=torch.int32)\n",
      "position_ids tensor([[0, 0, 0, 0, 0, 1, 2, 3, 4],\n",
      "        [0, 1, 2, 3, 4, 5, 6, 7, 8]], dtype=torch.int32)\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "EXAMPLE_Text = [\"best hotel in bay area\", \"here is an example of gpt2 model\"]\n",
    "\n",
    "\n",
    "def get_tokenizer(model_name_or_path, cache_dir):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)\n",
    "    tokenizer.padding_side = \"left\"\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "    return tokenizer\n",
    "\n",
    "\n",
    "def get_example_inputs(prompt_text=EXAMPLE_Text):\n",
    "    tokenizer = get_tokenizer(model_name_or_path, cache_dir)\n",
    "    encodings_dict = tokenizer.batch_encode_plus(prompt_text, padding=True)\n",
    "\n",
    "    input_ids = torch.tensor(encodings_dict[\"input_ids\"], dtype=torch.int32)\n",
    "    attention_mask = torch.tensor(encodings_dict[\"attention_mask\"], dtype=torch.int32)\n",
    "    position_ids = attention_mask.long().cumsum(-1) - 1\n",
    "    position_ids.masked_fill_(position_ids < 0, 0)\n",
    "    position_ids = position_ids.to(torch.int32)\n",
    "\n",
    "    # Empty Past State for generating first word\n",
    "    empty_past = []\n",
    "    batch_size = input_ids.size(0)\n",
    "    sequence_length = input_ids.size(1)\n",
    "    past_shape = [2, batch_size, num_attention_heads, 0, hidden_size // num_attention_heads]\n",
    "    for i in range(num_layer):\n",
    "        empty_past.append(torch.empty(past_shape).type(torch.float32).to(device))\n",
    "\n",
    "    return input_ids, attention_mask, position_ids, empty_past\n",
    "\n",
    "\n",
    "from transformers import GPT2LMHeadModel\n",
    "\n",
    "torch_model = GPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)\n",
    "device = torch.device(\"cpu\")\n",
    "torch_model.eval().to(device)\n",
    "\n",
    "input_ids, attention_mask, position_ids, empty_past = get_example_inputs()\n",
    "print(\"input_ids\", input_ids)\n",
    "print(\"attention_mask\", attention_mask)\n",
    "print(\"position_ids\", position_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "with torch.no_grad():\n",
    "    torch_output = torch_model(\n",
    "        input_ids, past_key_values=empty_past, attention_mask=attention_mask, position_ids=position_ids\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ONNX Runtime Inference ##\n",
    "\n",
    "We can use ONNX Runtime to inference. The inputs are dictionary with name and numpy array as value, and the output is list of numpy array. Note that both input and output are in CPU. When you run the inference in GPU, it will involve data copy between CPU and GPU for input and output.\n",
    "\n",
    "Let's create an inference session for ONNX Runtime given the exported ONNX model, and see the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import onnxruntime\n",
    "import numpy\n",
    "\n",
    "input_ids, attention_mask, position_ids, empty_past = get_example_inputs()\n",
    "\n",
    "session = onnxruntime.InferenceSession(onnx_model_path, providers=[\"CPUExecutionProvider\"])\n",
    "ort_inputs = {\n",
    "    \"input_ids\": numpy.ascontiguousarray(input_ids.cpu().numpy()),\n",
    "    \"attention_mask\": numpy.ascontiguousarray(attention_mask.cpu().numpy()),\n",
    "    \"position_ids\": numpy.ascontiguousarray(position_ids.cpu().numpy()),\n",
    "}\n",
    "for i, past_i in enumerate(empty_past):\n",
    "    ort_inputs[f\"past_{i}\"] = numpy.ascontiguousarray(past_i.cpu().numpy())\n",
    "ort_outputs = session.run(None, ort_inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare the outputs from PyTorch and ONNX Runtime. Logits are very close."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "max logits diff (ignored padding) tensor(7.6294e-05)\n"
     ]
    }
   ],
   "source": [
    "logits_masked_diff = (torch_output[0] - ort_outputs[0]) * attention_mask.unsqueeze(2)\n",
    "max_logits_diff = logits_masked_diff.abs().max()\n",
    "print(\"max logits diff (ignored padding)\", max_logits_diff)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ONNX Runtime Inference with IO Binding ##\n",
    "\n",
    "To avoid data copy for input and output, ONNX Runtime also supports IO Binding. User could provide some buffer for input and outputs. For GPU inference, the buffer can be in GPU to reduce memory copy between CPU and GPU. This is helpful for high performance inference in GPU. For GPT-2, IO Binding might help the performance when batch size or (past) sequence length is large."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Dict\n",
    "from onnxruntime import InferenceSession\n",
    "\n",
    "from onnxruntime.transformers.io_binding_helper import TypeHelper\n",
    "from onnxruntime.transformers.io_binding_helper import IOBindingHelper\n",
    "\n",
    "\n",
    "def inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, past):\n",
    "    output_shapes = Gpt2Helper.get_output_shapes(\n",
    "        batch_size=input_ids.size(0),\n",
    "        past_sequence_length=past[0].size(3),\n",
    "        sequence_length=input_ids.size(1),\n",
    "        config=config,\n",
    "    )\n",
    "    output_buffers = Gpt2Helper.get_output_buffers(output_shapes, device)\n",
    "\n",
    "    io_binding = IOBindingHelper.prepare_io_binding(\n",
    "        session, input_ids, position_ids, attention_mask, past, output_buffers, output_shapes\n",
    "    )\n",
    "    session.run_with_iobinding(io_binding)\n",
    "\n",
    "    outputs = Gpt2Helper.get_outputs_from_io_binding_buffer(session, output_buffers, output_shapes, return_numpy=False)\n",
    "    return outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the result is exactly same with/without IO Binding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IO Binding result is good\n"
     ]
    }
   ],
   "source": [
    "input_ids, attention_mask, position_ids, empty_past = get_example_inputs()\n",
    "outputs = inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, empty_past)\n",
    "for i in range(len(outputs)):\n",
    "    assert torch.eq(outputs[i], torch.from_numpy(ort_outputs[i])).all()\n",
    "print(\"IO Binding result is good\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Text Generation ##\n",
    "\n",
    "Here is an example for text generation using ONNX Runtime or PyTorch. For ONNX Runtime, IO Binding is used for better performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_generation(tokenizer, input_text, ort_session=None, num_tokens_to_produce=30):\n",
    "    assert len(input_text) == 1  # This function requires batch_size==1\n",
    "    use_onnxruntime = ort_session is not None\n",
    "    print(\"Text generation using\", \"OnnxRuntime\" if use_onnxruntime else \"PyTorch\", \"...\")\n",
    "    eos_token_id = tokenizer.eos_token_id\n",
    "\n",
    "    input_ids, attention_mask, position_ids, past = get_example_inputs(input_text)\n",
    "    batch_size = input_ids.size(0)\n",
    "\n",
    "    has_eos = torch.zeros(batch_size, dtype=torch.bool)\n",
    "\n",
    "    all_token_ids = input_ids.clone()\n",
    "\n",
    "    for step in range(num_tokens_to_produce):\n",
    "        if ort_session is not None:\n",
    "            outputs = inference_with_io_binding(ort_session, config, input_ids, position_ids, attention_mask, past)\n",
    "        else:\n",
    "            outputs = torch_model(\n",
    "                input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past\n",
    "            )\n",
    "\n",
    "        next_token_logits = outputs[0][:, -1, :]\n",
    "        # Greedy approach is used here. You can easily extend it to use beam search and sampling to pick next tokens.\n",
    "        next_tokens = torch.argmax(next_token_logits, dim=-1)\n",
    "\n",
    "        has_eos = has_eos | (next_tokens == eos_token_id)\n",
    "        tokens_to_add = next_tokens.masked_fill(has_eos, eos_token_id)\n",
    "        all_token_ids = torch.cat([all_token_ids, tokens_to_add.unsqueeze(-1)], dim=-1)\n",
    "\n",
    "        # Update input_ids, attention_mask, position_ids and past\n",
    "        input_ids = tokens_to_add.clone().detach().reshape([batch_size, 1]).to(device)\n",
    "        position_ids = (position_ids[:, -1] + 1).reshape(batch_size, 1)\n",
    "        attention_mask = torch.cat([attention_mask, torch.ones([batch_size, 1]).type_as(attention_mask)], 1).to(device)\n",
    "\n",
    "        past = []\n",
    "        if not use_onnxruntime:\n",
    "            past = list(outputs[1])  # past in torch output is tuple\n",
    "        else:\n",
    "            for i in range(num_layer):\n",
    "                past_i = (\n",
    "                    torch.from_numpy(outputs[i + 1])\n",
    "                    if isinstance(outputs[i + 1], numpy.ndarray)\n",
    "                    else outputs[i + 1].clone().detach()\n",
    "                )\n",
    "                past.append(past_i.to(device))\n",
    "\n",
    "        if torch.all(has_eos):\n",
    "            break\n",
    "\n",
    "    for i, output in enumerate(all_token_ids):\n",
    "        print(\"------------\")\n",
    "        print(tokenizer.decode(output, skip_special_tokens=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text generation using OnnxRuntime ...\n",
      "------------\n",
      "best hotel in bay area.\n",
      "\n",
      "The hotel is located in the historic Bayview neighborhood of San Francisco.\n",
      "\n",
      "The hotel is open daily from 9 a.m.\n"
     ]
    }
   ],
   "source": [
    "tokenizer = get_tokenizer(model_name_or_path, cache_dir)\n",
    "input_text = EXAMPLE_Text[:1]\n",
    "test_generation(tokenizer, input_text, ort_session=session)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use PyTorch to run again and we can see that the result is exactly same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text generation using PyTorch ...\n",
      "------------\n",
      "best hotel in bay area.\n",
      "\n",
      "The hotel is located in the historic Bayview neighborhood of San Francisco.\n",
      "\n",
      "The hotel is open daily from 9 a.m.\n"
     ]
    }
   ],
   "source": [
    "test_generation(tokenizer, input_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmark ##\n",
    "There is a tool benchmark_gpt2.py, which can be used to measure the performance of GPT-2 by PyTorch, ONNX Runtime without/with IO Binding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.models.gpt2.benchmark_gpt2 -m gpt2 -o -b 1 ----sequence_lengths 1 --past_sequence_lengths 128 >benchmark_output.txt 2>&1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "batch_size=1, sequence_length=1, past_sequence_length=8, onnxruntime_latency=21.33  \n",
      "\n",
      "batch_size=1, sequence_length=1, past_sequence_length=16, onnxruntime_latency=22.88  \n",
      "\n",
      "batch_size=1, sequence_length=1, past_sequence_length=32, onnxruntime_latency=22.81  \n",
      "\n",
      "batch_size=1, sequence_length=1, past_sequence_length=64, onnxruntime_latency=24.01  \n",
      "\n",
      "batch_size=1, sequence_length=1, past_sequence_length=128, onnxruntime_latency=22.87  \n",
      "\n",
      "batch_size=1, sequence_length=1, past_sequence_length=256, onnxruntime_latency=25.30  \n",
      "\n"
     ]
    }
   ],
   "source": [
    "file = open(\"benchmark_output.txt\", \"r\")\n",
    "for line in file.readlines():\n",
    "    if \"onnxruntime_latency\" in line:\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ONNX Model for Generation ##\n",
    "There is a tool [convert_generation.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py), which can convert GPT-2 model to enable Beam Search, Greedy Search or Sampling in one ONNX model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.convert_generation -m gpt2 --output gpt2_beam_search.onnx >convert_generation_output.txt 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quantization ##\n",
    "For large language model, we can quantize the model to int8 to run efficiently in CPU. See examples of [GPT quantization](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/gpt2) and [LLaMA quantization](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test Environment ###\n",
    "The following is the hardware of the test machine, and software version:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"gpu\": {\n",
      "    \"driver_version\": \"472.88\",\n",
      "    \"devices\": [\n",
      "      {\n",
      "        \"memory_total\": 12884901888,\n",
      "        \"memory_available\": 12732858368,\n",
      "        \"name\": \"NVIDIA GeForce RTX 3060\"\n",
      "      }\n",
      "    ]\n",
      "  },\n",
      "  \"cpu\": {\n",
      "    \"brand\": \"Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz\",\n",
      "    \"cores\": 6,\n",
      "    \"logical_cores\": 12,\n",
      "    \"hz\": \"3192000000,0\",\n",
      "    \"l2_cache\": 1572864,\n",
      "    \"flags\": \"3dnow,3dnowprefetch,abm,acpi,adx,aes,apic,avx,avx2,bmi1,bmi2,clflush,clflushopt,cmov,cx16,cx8,de,dtes64,dts,erms,est,f16c,fma,fpu,fxsr,hle,ht,hypervisor,ia64,invpcid,lahf_lm,mca,mce,mmx,monitor,movbe,mpx,msr,mtrr,osxsave,pae,pat,pbe,pcid,pclmulqdq,pdcm,pge,pni,popcnt,pse,pse36,rdrnd,rdseed,rtm,sep,serial,sgx,sgx_lc,smap,smep,ss,sse,sse2,sse4_1,sse4_2,ssse3,tm,tm2,tsc,tscdeadline,vme,x2apic,xsave,xtpr\",\n",
      "    \"processor\": \"Intel64 Family 6 Model 158 Stepping 10, GenuineIntel\"\n",
      "  },\n",
      "  \"memory\": {\n",
      "    \"total\": 16977195008,\n",
      "    \"available\": 9550102528\n",
      "  },\n",
      "  \"os\": \"Windows-10-10.0.22621-SP0\",\n",
      "  \"python\": \"3.10.12.final.0 (64 bit)\",\n",
      "  \"packages\": {\n",
      "    \"flatbuffers\": \"23.5.26\",\n",
      "    \"numpy\": \"1.25.2\",\n",
      "    \"onnx\": \"1.14.0\",\n",
      "    \"onnxruntime\": \"1.15.1\",\n",
      "    \"protobuf\": \"4.23.4\",\n",
      "    \"sympy\": \"1.12\",\n",
      "    \"torch\": \"2.0.1\",\n",
      "    \"transformers\": \"4.18.0\"\n",
      "  },\n",
      "  \"onnxruntime\": {\n",
      "    \"version\": \"1.15.1\",\n",
      "    \"support_gpu\": false\n",
      "  },\n",
      "  \"pytorch\": {\n",
      "    \"version\": \"2.0.1+cpu\",\n",
      "    \"support_gpu\": false,\n",
      "    \"cuda\": null\n",
      "  },\n",
      "  \"tensorflow\": null\n",
      "}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "f:\\anaconda3\\envs\\cpu_env\\lib\\site-packages\\onnxruntime\\transformers\\machine_info.py:127: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html\n",
      "  import pkg_resources\n"
     ]
    }
   ],
   "source": [
    "!{sys.executable} -m onnxruntime.transformers.machine_info --silent"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cpu_env",
   "language": "python",
   "name": "cpu_env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
